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Abstract 


The  Fast  Johnson-Lindenstrauss  Transform  (FJLT)  was  recently  discovered  by  Ailon  and 
Chazelle  as  a  novel  technique  for  performing  fast  dimension  reduction  with  small  distortion 
from  £2  to  1 2  in  time  0(max{dlog  d,  k3}).  For  k  in  [fi(log  d),  0{d1^2)]  this  beats  time  0{dk) 
achieved  by  naive  multiplication  by  random  dense  matrices,  an  approach  followed  by  several 
authors  as  a  variant  of  the  seminal  result  by  Johnson  and  Lindenstrauss  (JL)  from  the  mid 
80’s.  In  this  work  we  show  how  to  significantly  improve  the  running  time  to  0{d  log  k)  for 
k  =  0(d1/2_<5),  for  any  arbitrary  small  fixed  S.  This  beats  the  better  of  FJLT  and  JL.  Our 
analysis  uses  a  powerful  measure  concentration  bound  due  to  Talagrancl  applied  to  Rademacher 
series  in  Banach  spaces  (sums  of  vectors  in  Banach  spaces  with  random  signs).  The  set  of  vectors 
used  is  a  real  embedding  of  dual  BCH  code  vectors  over  GF( 2).  We  also  discuss  the  number  of 
random  bits  used  and  reduction  to  £\  space. 

The  connection  between  geometry  and  discrete  coding  theory  discussed  here  is  interesting 
in  its  own  right  and  may  be  useful  in  other  algorithmic  applications  as  well. 


1  Introduction 

Applying  random  matrices  is  by  now  a  well  known  technique  for  reducing  dimensionality  of  vectors 
in  Euclidean  space  while  preserving  certain  properties  (most  notably  distance  information).  Begin¬ 
ning  with  the  classic  work  of  Johnson  and  Lindenstrauss  [16]  who  used  projections  onto  random 
subspaces,  other  variants  of  the  technique  using  different  distributions  are  known  [1,  9,  15,  22]  and 
have  been  used  in  many  algorithms  [18,  20,  3,  13,  26,  24,  11]. 

In  all  the  variants  of  this  idea,  a  fixed  unit  length  vector  x  G  Rd  is  mapped  onto  Rfc  (k  <  d ) 
via  a  random  linear  mapping  $  from  a  carefully  chosen  distribution.  A  measure  concentration 
principle  is  used  to  show  that  the  distribution  of  the  norm  estimator  error  1 1 1  <L>a;  1 1 2  —  1|  in  a  small 
neighborhood  of  0  is  dominated  by  a  gaussian  of  standard  deviation  ^(At1/2),  uniformly  for  all  x 
and  independent  on  d.  The  distribution  of  $  need  not  even  be  rotationally  invariant.  When  used 
in  an  algorithm,  k  is  often  chosen  as  0(e“2logn)  so  that  a  union  bound  ensures  that  the  error  is 
smaller  than  a  fixed  e  simultaneously  for  all  n  vectors  in  some  fixed  input  set.  Noga  Alon  proved 
[2]  that  this  choice  of  k  is  essentially  optimal  and  cannot  be  significantly  reduced. 

It  makes  sense  to  abstract  the  definition  of  a  distribution  of  mappings  that  can  be  used  for  dimen¬ 
sion  reduction  in  the  above  sense.  We  will  say  that  such  a  mapping  has  the  Johnson-Lindenstrauss 
property  (JLP),  named  after  the  authors  of  the  first  such  construction  (we  make  an  exact  definition 
of  this  property  in  Section  2).  In  view  of  Ailon  and  Chazelle’s  FJLT  result  [1],  it  is  natural  to  ask 
about  the  computational  complexity  of  applying  a  mapping  drawn  from  a  JLP  distribution  on  a 
vector.  The  resources  considered  here  are  time  and  randomness.  Ailon  et  al  showed  that  reduction 
from  d  dimensions  to  k  dimensions  can  be  performed  in  time  0(max{dlog<i},  A;3),  beating  the  naive 
0(kd )  time  implementation  of  JL  for  k  in  cj(logd)  and  o(d1/2).  Similar  bounds  were  found  in  [1] 
for  reducing  onto  i\  (Manhattan)  space,  but  with  quadratic  (not  cubic)  dependence  on  k.  From 
recent  work  by  Matousek  [22]  it  can  be  shown,  by  replacing  gaussian  distributions  with  ±l’s,  that 
Ailon  and  Chazelle’s  algorithm  for  the  Euclidean  case  requires  0(max{d,  A;3})  random  bits  in  the 
Euclidean  case. 

1.1  Our  Results 

This  work  contains  several  contributions.  We  summarize  them  for  the  Euclidean  case  in  Table  1.1 
for  convenience.  The  first  (in  Section  7)  is  a  simple  trick  that  can  be  used  to  reduce  the  running 
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time  of  FJLT  [1]  to  0(max{dlog  k},  fc3),  hence  making  it  better  than  the  naive  algorithm  for  small 
k  (first  column  in  the  table).  In  typical  applications,  the  running  time  translates  to  O(dloglogn), 
where  n  is  the  number  of  points  we  simultaneously  want  to  reduce  (assuming  n  =  2°^1/3^). 


k  in  o(log  d) 

k  in  <x(log  d) 
and  o(poly(d)) 

k  in  fl(poly(d)) 
and  o((dlogd)1//3) 

k  in  lo(( dlogd )3/3) 
and  0(d 1/2^'5) 

Fast 

This  work 

This  work 

This  work,  FJLT 

This  work 

JL 

FJLT 

FJLT 

Slow 

FJLT 

JL 

JL 

JL 

Table  1:  Schematic  comparison  of  asymptotical  running  time  of  this  work,  Ailon  and  Chazelle’s 
work  [1]  (FJLT)  and  a  naive  implementation  of  Johnson-Lindenstrauss  (JL),  or  variants  thereof. 

The  main  contribution  (Sections  5-6)  is  improving  the  case  of ’’large  k ”  (rightmost  column  in  the 
Table  1.1).  We  need  tools  from  the  theory  of  probability  and  norm  interpolation  in  Banach  spaces 
(Section  3)  as  well  as  the  theory  of  error  correcting  codes  (Section  4)  to  construct  a  distribution 
on  matrices  satisfying  JLP  that  can  be  applied  in  time  O(dlogd)  (note  that  in  this  case  log d  = 
0(log/c)).  The  ideas  used  in  our  constructions  are  interesting  in  their  own  right,  because  they  take 
advantage  of  advanced  ideas  from  different  classic  theories.  These  ideas  provide  a  new  algorithmic 
application  of  error  correcting  codes,  an  extremely  useful  tool  in  theoretical  computer  science  with 
applications  in  both  complexity  and  algorithms  (a  good  overview  can  be  found  here  [25],  a  more 
recent  example  here  [17]). 

A  note  on  ’’large  kv:  As  stated  above,  k  is  typically  0(e^2logn),  where  e  is  a  desired  distortion 
bound  and  n  is  the  number  of  vectors  we  wish  to  reduce.  Although  logn  is  typically  small  (loga¬ 
rithmic  in  the  input  size),  in  various  applications,  especially  in  scientific  computation,  e~2  may  be 
large.  This  case  is  therefore  important  to  study. 

It  is  illustrative  to  point  out  an  apparent  weakness  in  [1]  which  was  a  starting  point  of  our 
work.  The  main  tool  used  there  is  to  multiply  the  input  vector  x  by  a  random  sign  change  matrix 
followed  by  a  Fourier  transform,  resulting  in  a  vector  y.  It  is  claimed  that  ||y||oo  is  small  (in  other 
words,  the  ’’information”  is  spread  out  evenly  among  the  coordinates).  By  a  convexity  argument 
the  ’’worst  case”  y  assuming  only  the  £oo  bound  is  a  uniformly  supported  vector,  namely,  a  vector 
in  which  the  absolute  value  of  the  coordinates  in  its  (small)  support  are  all  equal.  Intuitively,  such 
a  vector  is  extremely  unlikely.  In  this  work  we  consider  other  norms. 

It  is  likely  that  the  techniques  we  develop  here  can  be  used  in  conjunction  with  very  recent 
research  on  explicit  embeddings  of  £2  in  £\  [23,  14,  4]  as  well  as  research  on  fast  approximate  linear 
algebraic  scientific  computation  [24,  10,  6,  7,  8,  27]. 

2  Preliminaries 

We  use  £p  to  denote  d  dimensional  real  space  equipped  with  the  norm  ||x||  =  ||x||p  =  (Y^'!=\ 
where  1  <  p  <  00  and  Hx’Hoo  =  max{|xj|}.  The  dual  norm  index  q  is  defined  by  the  solution  to 
1/q  +  1/p  =  1.  We  remind  the  reader  that  ||x||p  =  sup  yeed  xTy  .  For  a  real  k  x  d  matrix  A,  the 
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matrix  norm  ||^4||pi^p  is  defined  as  the  operator  norm  of  A  :  t^)1  — ►  £k  or: 

||A||p1_>p  =  sup  ||At||p  =  sup  sup  yT Ax  . 

xee^  yetq  xee^ 

||arl]=i  IMI=i  ||xpl 

In  what  follows  we  use  d  to  denote  the  original  dimension  and  k  <  d  the  target  (reduced) 
dimension.  The  input  vector  will  be  x  =  (x\, . . . ,  Xd)T  £  £%■  Since  we  only  consider  linear  reductions 
we  will  assume  without  loss  of  generality  that  ||.x||2  =  1. 

Definition  2.1  A  family  of  distributions  V(d,k)  on  k  x  d  real  matrices  (k  <  d)  has  the  Johnson- 
Lindenstrauss  property  (JLP)  with  respect  to  a  norm  index  p  if  for  any  unit  vector  x  £  l\  and 
0  <  £  <  1/2, 

Pr  [1  —  £  <  II ArlL  <  1  +  e]  >  1  —  c\e~C2ke2  (1) 

for  some  global  ci,C2  >  0. 

(A  similar  definition  was  given  in  [24].)  In  this  work,  we  study  the  cases  p  =  1  ( Manhattan  JLP) 
and  p  =  2  ( Euclidean  JLP).  We  make  a  few  technical  remarks  about  Definition  2.1: 

•  For  most  dimension  reduction  applications  k  =  fl(£-2),  so  the  constant  ci  can  be  ” swallowed” 
by  C2,  but  we  prefer  to  keep  it  here  to  avoid  writing  0(e~^k£  -1)  and  for  generality. 

•  The  definition  is  robust  with  respect  to  bias  of  0(A:-1/2).  More  precisely,  if  we  prove  PrfjU— e  < 
||.Ax’||p  <  p  +  e]  >  1  —  cie~C2ks  for  some  p  satisfying  \p  —  1|  =  0{k~l^2):  then  this  would 
imply  (1),  with  possibly  different  constants.  We  will  use  this  observation  in  what  follows. 

Recall  that  the  Walsh-Hadamard  matrix  Hd  is  a  d  x  d  orthogonal  matrix  with  Hd(i,j )  = 
2~ rf/2( — i) <*D)  for  i  j  £  [0,  d—  1],  where  (i.  j)  is  dot  product  (over  IF2)  of  i,j  viewed  as  (logd)-bit 
vectors.  The  matrix  encodes  the  Fourier  transform  over  the  binary  hypercube.  It  is  well  known 
that  x  e->  H^x  £  £2  can  be  computed  in  time  O(dlogd)  for  any  x  £  l^-,  and  that  the  mapping  is 
isomorphic. 

Definition  2.2  A  matrix  A  £  Rmxd  is  a  code  matrix  if  every  row  of  A  is  equal  to  some  row  of  H 
multiplied  by  y/d/m. 

The  normalization  is  chosen  so  that  columns  have  Euclidean  norm  1. 


2.1  Statement  of  our  Theorems 

The  main  contribution  is  in  Theorem  2.4  below. 

Theorem  2.3  For  any  code  matrix  A  of  size  kxd  for  k  <  d,  the  mapping  x  Ax  can  be  computed 
in  time  0(d  log  k). 

Clearly  this  theorem  is  interesting  only  for  log  k  =  o(log  d) ,  because  otherwise  the  Walsh-Hadamard 
transform  followed  by  projection  onto  a  subset  of  the  coordinates  can  do  this  in  time  O(dlogd), 
by  definition  of  a  code  matrix.  As  a  simple  corollary,  the  running  time  of  the  algorithms  in  [1]  can 
be  reduced  to  O(max{o?log  k,  k 3}),  because  effectively  what  they  do  is  multiply  the  input  x  (after 
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random  sign  change)  by  a  code  matrix  of  size  0(k 3)  x  d  and  then  manipulate  the  outcome  in  time 
0(k3).  This  gives  the  left  column  of  Table  1.1.  We  omit  the  details  of  this  result  and  refer  the 
reader  to  [1,  22], 

Theorem  2.4  Let  5  >  0  be  some  arbitrarily  small  constant.  For  any  d,k  satisfying  k  <  ci1/2-5 
there  exists  an  algorithm  constructing  a  random  matrix  A  of  size  k  x  d  satisfying  JLP,  such  that 
the  time  to  compute  x  >  Ax  for  any  x  £  R0'  is  O(dlogk).  The  construction  uses  O(d)  random 
bits  and  applies  to  both  the  Euclidean  and  the  Manhattan  cases. 

We  will  prove  a  slightly  weaker  running  time  of  O(dlogd)  below,  and  provide  a  sketch  for 
reducting  it  to  0(d  log  k),  where  the  full  details  or  the  improvement  are  deferred  to  Appendix  A. 
This  improvement  is  interesting  for  small  k,  and  provides  a  unified  solution  for  all  k  <  d1/2-5 , 
though  the  small  k  case  can  also  be  be  taken  care  using  Theorem  2.3  above  in  conjunction  with 
FJLT  [1],  The  main  contribution  of  theorem  2.3,  of  course,  is  in  getting  rid  of  the  term  /c3  in  the 
running  time  of  FJLT. 

3  Tools  from  Banach  Spaces 

The  following  is  known  as  an  interpolation  theorem  in  the  theory  of  Banach  spaces.  For  a  proof, 
refer  to  [5]. 

Theorem  3.1  [Riesz-Thorin]  Let  A  be  an  m  x  d  real  matrix,  and  assume  ||A||Pl^ri  <  C\  and 
H-AHpj-^  <  C*2  for  some  norm  indices  Pi,ri,p2,r2-  Let.  X  be  a  real  number  in  the  interval  [0,1], 
and  let  p,  r  be  such  that  1/p  =  X(l/p\)  +  (1  —  X)(l/p2)  and  1/r  =  A(l/ri)  +  (1  —  A) (1  /?'2) -  Then 
\\A\\p^r<ClCl~\ 

Theorem  3.2  [Hausdorff- Young]  For  norm  index  1  <  p  <  2,  <  d_1/p+1/2,  where  q  is 

the  dual  norm  index  of  p. 

(The  theorem  is  usually  stated  with  respect  to  the  Fourier  operator  for  functions  on  the  real  line 

or  on  the  circle,  and  is  a  simple  application  of  Riesz-Thorin  by  noticing  that  || /Y|| 2 _ >2  =  1  and 

\\H\\1^00  =  d~1/2.) 

Let  M  be  a  real  matrix  mxd  matrix,  and  let  z  £  Rd  be  a  random  vector  with  each  zi  distributed 
uniformly  and  independently  over  {±1}.  The  random  vector  Mz  £  is  known  as  a  Rademacher 
random  variable.  A  nice  exposition  of  concentration  bounds  for  Rademacher  variables  is  provided 
in  Chapter  4.7  of  [19]  for  more  general  Banach  spaces.  For  our  purposes,  it  suffices  to  review  the 
result  for  finite  dimensional  £p  space.  Consider  the  norm  Z  =  ||Mz||p  (we  say  that  ” Z  is  the  norm 
of  a  Rademacher  random  variable  in  corresponding  to  M”).  We  associate  two  numbers  with  Z, 

•  The  deviation  a ,  defined  as  ||M||2->P,  and 

•  a  median  p  of  Z. 

Theorem  3.3  For  any  t  >  0,  Pr[|Z  —  p\  >  t]  <  4e~t2 . 

The  theorem  is  a  simple  consequence  of  a  powerful  theorem  of  Talagrand  (Chapter  1,  [19]) 
on  measure  concentration  of  functions  on  {— l,+l}d  extendable  to  convex  functions  on  i with 
bounded  Lipschitz  norm. 


4 


4  Tools  from  Error  Correcting  Codes 

Let  A  be  a  code  matrix,  as  defined  above.  The  columns  of  A  can  be  viewed  as  vectors  over  F2  under 
the  usual  transformation  ((+)  — >  0,  (— )  — >  1).  Clearly,  the  set  of  vectors  thus  obtained  are  closed 
under  addition,  and  hence  constitute  a  linear  subspace  of  F™.  Conversely,  any  linear  subspace  V  of 
F™  of  dimension  v  can  be  encoded  as  an  m  x  2V  code  matrix  (by  choosing  some  ordered  basis  of  V). 
We  will  borrow  well  known  constructions  of  subspaces  from  coding  theory,  hence  the  terminology. 
Incidentally,  note  that  Hd  encodes  the  Hadanrard  code,  equivalent  to  a  dual  BCH  code  of  designed 
distance  3. 

Definition  A  code  matrix  A  of  size  m  x  d  is  a-wise  independent  if  for  each  1  <  i\  <  i2  <  •  •  •  <  ia  < 
m  and  (b\,  62,  •  •  • ,  ba)  G  {+1,  —1}“,  the  number  of  columns  A (J)  for  which  (A^,  A^\  . . . ,  A^J)  = 
rri  -1/2(h,b2,,,.,ba)  is  exactly  d/2a. 

Lemma  4.1  There  exists  a  4-wise  independent  code  matrix  of  size  kx  f  BCH^  >  w^ere  f BCH (^)  = 
(-)(/.-). 

The  family  of  matrices  is  known  as  binary  dual  BCH  codes  of  designed  distance  5.  Details  of  the 
construction  can  be  found  in  [21]. 


5  Reducing  to  Euclidean  Space  for  k  <  d1'2-6 


Assume  5  >  0  is  some  arbitrarily  small  constant.  Let  B  be  a  k  x  d  matrix  with  Euclidean  unit 
length  columns,  and  D  a  random  {±1}  diagonal  matrix.  Let  Y  =  U-BD2H2.  Our  goal  is  to  get  a 
concentration  bound  of  Y  around  1.  Notice  that  E[Y2]  =  1.  In  order  to  use  Theorem  3.3,  we  let 
M  denote  the  k  x  d  matrix  with  the  Tth  column  being  XiB^l\  where  denotes  the  i’th 
column  of  B.  Clearly  Y  is  the  norm  of  a  Rademacher  random  variable  in  corresponding  to  M. 
We  estimate  the  deviation  a  and  median  //,  as  defined  in  Section  3. 


a  =  ||M||2_2  =  sup  '\tjM  1 2 

yee% 

M=i 

/  d  \  V2 

=  sup  Ij^x^B^)2) 


\i=  1 


<  1 1  1 1 4  SUp 


=  llxIUII  B1 


) 

f  d  \  V4 

E^W)4) 


y 


2^4 


(2) 


(The  inequality  is  Cauchy-Schwartz.)  To  estimate  the  median  //.,  we  compute 

/*oo  COO 

E[(Z  -  p)2]  =  /  Pr[(Z  -  p)2}  >  s]ds  <  /  4 e~s/{&a2)ds  =  32 a2  . 

Jo  Jo 

The  inequality  is  an  application  of  theorem  3.3.  Recall  that  E[Z2]  =  1.  Also,  E[Z]  =  E[V Z2\  < 
yj . E[Z2]  =  1  (by  Jensen).  Hence  E[(Z  —  p)2]  =  E[Z2]  —  2 pE[Z]  +  p2  >  1  —  2p  +  p2  =  (1  —  p)2. 
Combining,  |1  —  p\  <  \J32a.  We  conclude, 
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Corollary  5.1  For  any  t  >  0, 

Pr[| Z  -  1|  >  t]  <  c3exp{-C4t2/(||x||l||5T||^4)}  , 
for  some  global  03,04  >  0. 

In  order  for  the  distribution  of  BD  to  satisfy  JLP,  we  need  to  have  a  =  C^/C1/2).  This 

requires  controlling  both  |£>7  1 1 2 _ >4  and  ||x||4.  We  first  show  how  to  design  a  matrix  B  that  is  both 

efficiently  computable  and  has  a  small  norm.  The  latter  quantity  is  adversarial  and  cannot  be 
directly  contolled,  but  we  are  allowed  to  manipulate  x  by  applying  a  (random)  orthogonal  matrix 
$  without  losing  any  information. 

5.1  Bounding  ||-BT||2^4  Using  BCH  Codes 

Lemma  5.2  Assume  B  is  a  kxd  A-wise  independent  code  matrix.  Then  || Z?2" 1 1 2 _ >4  <  (3 d)1//4fc-1/2. 

Proof  For  y  £  ,  ||y||  =  1, 

\\yTB\\\  =  dEjm[{yTB{j)f] 

k 

=  dk~ 2  E  Ebi  ,b2,b3,b4  [yh  Vi2  Vh  yiMb2hbi\  (3) 

*1, *2, *3, *4=1 

=  dk~2(3\\y\\'2  -  2\\y\\i)  <  3 dk~2  , 

where  61,  b-2,  63,  64  are  random  {+1,  —1}  variables.  We  now  use  the  BCH  codes.  Let  B f.  denote  the 
k  x  /gcfj(^)  matrix  from  the  Lemma  4.1  (we  assume  here  that  k  =  2a  —  1  for  some  integer  a;  This  is 
harmless  because  otherwise  we  can  reduce  onto  some  k'  =  2“  —  1  such  that  k/2  <  k'  <  k  and  pad  the 
output  with  k  —  k!  zeros).  In  order  to  construct  a  matrix  B  of  size  kxd  for  k  <  ci1//2~5,  we  first  make 
sure  that  d  is  divisible  by  /bCh(^)  (by  at  most  multiplying  d  by  a  constant  factor  and  padding 
with  zeros),  and  then  define  B  to  be  ^//bCh(^)  coPies  °f  E k  side  by  side.  Clearly  B  remains 
4-wise  independent.  Note  that  B  may  no  longer  be  a  code  matrix,  but  x  e- >  Bx  is  computable  in 
time  O(dlogk)  by  performing  d//gQjj(/c)  Walsh  transforms  on  blocks  of  size  /bCh(^)- 

5.2  Controlling  ||rc||4  for  k  <  di/2-s 

We  define  a  randomized  orthogonal  transformation  $  that  is  computable  in  0(d  log  d)  time  and 
succeeds  with  probability  1  —  0(e~k )  for  all  k  <  d 1/2~5.  Success  means  that  ||$x||4  =  0(cC1//4). 
(Note:  Both  big-O’s  hide  factors  depending  on  5).  Note  that  this  construction  gives  a  running  time 
of  O(dlogd).  We  discuss  later  how  to  do  this  for  arbitrarily  small  k  with  running  time  O(dlogk). 

The  basic  building  block  is  the  product  HD' ,  where  H  =  H f  is  the  Walsh-Hadamard  matrix 
and  D'  is  a  diagonal  matrix  with  random  i.i.d.  uniform  {±1}  on  the  diagonal.  Note  that  this 
random  transormation  was  the  main  ingredient  in  [1].  Let  denote  the  Pth  column  of  H. 

We  are  interested  in  the  random  variable  X  =  HHZTx’l^.  We  define  M  as  the  d  x  d  matrix 
with  the  Pth  column  being  XiH®,  we  let  p  =  4  (q  =  4/3),  and  notice  that  X  is  the  norm  of 
the  Rademacher  random  variable  in  t\  corresponding  to  M  (using  the  notation  of  Section  3).  We 
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compute  the  deviation  a, 


a  =  \\M\\2^4  =  l!MT||4/3^2 

\  1/2 

J 

II 3/  II 4/3  =  1  (4) 

(  \  1/4 

ii  ii  ii  'T'  1 1 

=  \\x  4 \\H  4.  . 

3 

(Note  that  HT  =  H .)  By  the  HausdorfF- Young  theorem,  ||f/'||4_).4  <  d~ 1//4.  Hence,  a  <  ||x||4d_1//4. 
We  now  get  by  Theorem  3.3  that  for  all  t  >  0, 

Pr[|  \\HD’x\U  -n\  >t}<  4e-t2/(8||x|l'd"1/2)  ,  (5) 


where  p  is  a  median  of  X. 

Claim  5.3  //  =  0(g?-1/4)  . 

Proof  To  see  the  claim,  notice  that  for  each  separate  coordinate  E[(H D1  x)f]  =  0{d~2)  and  then 
use  linearity  of  expectation  to  get  E,[||fYD/x||4]  =  0(d_1).  By  Jensen  inequality,  £,[||iYD/x||4]  < 
E[\\HD'x\\i}b/4  =  0(d~b/A )  for  6=1,2,  3.  Now 

roc  roc 

E[(\\HD'x\\4  -  p)4}  =  /  Fr[(\\HD'x\\4  -  pf  >  s}ds  <  /  4e-sl/2/(8||x||2,i"1/2)d.s 

Jo  Jo 

=  Old-1)  . 

This  implies  that  71  cU1  —  72 d~3^4p  +  73 dr2l4pf  —  7 4d~l^4p2,  +  p4  <  75 d_1,  where  7 j  is  a  global 
constant  for  i  =  1,  2,  3, 4, 5.  Clearly  this  implies  the  statement  of  the  claim. 

Let  eg  be  such  that  p4  <  cgd_1//4.  We  weaken  inequality  (5)  using  the  last  claim  to  obtain  the 
following  convenient  form: 

Pr[\\HD'x\\4  >  c9d~1/4  +  t}<  4e-t2/(8Hxll^“1/2)  .  (6) 

In  order  to  get  a  desired  failure  probability  of  0(e~k )  set  t  =  csA:1^||x’||4(i_1/4.  For  k  <  dlJ2~8 
this  gives  t  <  csd^s^2\\x\\4.  In  other  words,  with  probability  1  —  0(e~k )  we  get 

||.fLD,a:||4  <  cgd-1^4  +  csd~s^2\\x\\4  . 

Now  compose  this  r  times:  Take  independent  random  diagonal  {±1}  matrices  D'  =  D W,  , . . . ,  D ^ 
and  define  *  =  HD^HD^r~4^  ■  •  •  HD^\  Using  a  union  bound  on  the  conditional  failure  prob¬ 
abilities,  we  easily  get: 

Lemma  5.4  \l4  reduction  for  k  <  d1/2-5]  With  probability  1  —  0(e~k ) 

||$(r)x||4  =  0{d~1/4)  (7) 

for  r  =  [1/25] . 
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Note  that  the  constant  hiding  in  the  bound  (7)  is  exponential  in  1/5. 

Combining  the  above,  the  random  transformation  A  =  BD<&^  has  Euclidean  JLP  for  k  < 
d 1/2— and  can  be  applied  to  a  vector  in  time  0(dlogd).  This  proves  the  Euclidean  case  of 
Theorem  2.4. 

5.3  Reducing  the  Running  Time  to  O(dlogk) 

We  explain  how  to  reduce  the  running  time  to  O(dlogk),  using  the  new  tools  developed  here.  This 
provides  a  unified  solution  to  the  problem  of  designing  efficient  Johnson-Lindenstrauss  projections 
for  all  k  up  to  d1^2^8 .  Recall  that  in  the  construction  of  B  we  placed  copies  of  the  same 

code  matrix  B k  of  size  ^x/bCh(^)  s^e  by  side.  It  turns  out  that  we  can  apply  this  ’’decomposition” 
of  coordinates  to  Indeed,  let  Ij  C  [d]  denote  the  j’th  block  of  (3  =  f^Q^(k)k8  consecutive 

coordinates  (assume  that  (3  is  an  integer  that  divides  d).  For  a  vector  y  e  let  yj.  €  Ip  denote  the 

projection  of  y  onto  the  set  of  coordinates  Ij.  Now,  instead  of  using  as  above,  we  use  a 

block-diagonal  d  x  d  matrix  comprised  of  d/(3  (3  x  (3  blocks  each  drawn  from  the  same  distribution 
as  Clearly  the  running  time  of  the  block-diagonal  matrix  is  O(dlogk),  by  applying  the  Walsh 
transform  independently  on  each  block  (recall  that  (3  =  f~BCn{k)k8  =  0(k2+8)). 

In  order  to  see  why  this  still  works,  one  needs  to  repeat  the  above  proofs  using  a  family  of  norms 

II  '  Il(pi,p2)  indexed  by  two  norm  indices  pi,P2  and  defined  as  IMI(Pl,P2)  =  (j2j= i  II^cIIpi)  •  We 
discuss  this  in  detail  in  Appendix  A. 

6  Reducing  to  Manhattan  Space  for  k  <  dl/2-5 

We  sketch  this  simpler  case.  As  we  did  for  the  Euclidean  case,  we  start  by  studying  the  random 
variable  W  6  defined  as  W  =  ||/c1/2RI7x||i  for  B  as  described  in  Section  5  and  D  a  random 
±l-diagonal  matrix.  In  order  to  characterize  the  concentration  of  W  (the  norm  of  a  Rademacher 
r.v.  in  l\)  we  compute  the  deviation  a,  and  estimate  a  median  p.  As  before,  we  set  M  to  be  the 
k  x  d  matrix  with  the  V th  column  being  kd^B^Xi. 


cr  =  sup  \\yTM\\2  =  sup  [  k^2lx2i(yT B^f 


VStoo 

111/ 11  =  1 


t=l 


<  sup^/2||x||4|bTR«||4  =  A;1/2||x||4||Rt| 


1/2 


oo^4 


(8) 


Using  the  tools  developed  in  the  Euclidean  case,  we  can  reduce  ||x||4  to  0(c?_1//4)  with  probability 
1  —  0{e~k)  using  4>r(d),  in  time  O(dlogd)  (in  fact,  O(dlogk)  using  the  improvement  from  Appen- 

dex  A).  Also  we  already  know  from  Section  5.1  that  \\B 1 1 1 2 _ >4  =  O^d/^k^1^2)  if  B  is  comprised 

of  A:  x  /bch(^)  dual  BCH  codes  (of  designed  distance  5)  matrices  side  by  side  (assume  /bCh(^) 
divides  d).  Since  ||y||oo  >  A:_1/2||y||2  for  any  y  6  Ik,  we  conclude  that  ||Rr||00^4  =  0(d1^). 
Combining,  we  get  cr  =  Of/k1^2).  We  now  estimate  the  median  p  of  W . 

In  order  to  calculate  p  we  first  calculate  E(W)  =  A:E/[|P|]  where  P  is  any  single  coordinate  of 
kx!2BDx.  We  follow  (almost  exactly)  a  proof  by  Matousek  in  [22]  where  he  uses  a  quantitative 
version  of  the  Central  Limit  Theorem  by  Konig,  Schiitt,  and  Tomczak  [12]. 


Lemma  6.1  [Konig-Schiitt-Tomczak]  Let  z\  . . .  z(j  be  independent  symmetric  random  variables 
with  1  E[z£\  =  1,  let  F(t )  =  Pi'EiLi  zi  <  and  W(i)  =  ^  fl0 0  e_a;2 /2dx.  Then 


|F(t)  -  <p(t)|  < 


C 

l  +  l^l3 


d 


5>[M3] 


/or  all  t  G  R  and  some  constant  C . 

Clearly  we  can  write  P  =  Yli=izi  where  =  D'ixi  and  each  D’l  is  a  random  ±1.  Note  that 
Eti£[N3]  =  ll  x|||.  Let  /3  be  the  constant  (the  expectation  of  the  absolute  value  of 

a  Gaussian). 


\E[\P\}~(3\  = 

< 

< 


’  —  OO 
roo 


’  —  OO 


|t|dF(t)  -  /  |i|<fy?(i) 

J — OO 

|F(t)  —  ^(t)  |  dt 

/ 


poo  Q 

|X||g  / 


-OO  1  + 


We  claim  that  ||x||3  =  0(/c  1).  To  see  this,  recall  that  1 1 a; 1 1 2  =  1 , 1 1 a; 1 1 4  =  0(d  4/4).  Equiva¬ 
lently,  ||xt||2^2  =  1  and  Wx1  1 1 4/3 _ >2  =  0(d-1^4)-  By  applying  Riesz-Thorin,  we  get  that  ||a:|| 3  = 

\\x7  1 1 3 /2 _ >2  =  0(d-1^6),  hence  Hx’Hg  =  0(d-1/2).  Since  k  =  0(d4/2)  the  claim  is  proved. 

By  Linearity  of  expectation  we  get  E(W)  =  kf3{l  ±  0(k ~4)).  We  now  bound  the  distance  of 
the  median  from  the  expected  value. 


I  E(W)-n\  <  E[\W-ix\\ 

/*oo 

=  /  Pt[\W-/jl\  >  t\dt 


< 


4e-t2/(  8P)df  =  0(fcl/2) 


(we  used  our  estimate  a  =  0(A;1//2)  above.)  We  conclude  that  //  =  k/3(l  +  0(k~1/2)).  This  clearly 
shows  that  (up  to  normalization)  the  random  transformation  BD (where  r  =  [1/5])  has  the 
JL  property  with  respect  to  embedding  into  Manhattan  space.  The  running  time  is  O(dlogd). 


7  Trimmed  Walsh-Hadamard  transform 


We  prove  Theorem  2.3.  For  simplicity,  let  H  =  H It  is  well  known  that  computing  the  Walsh- 
Hadamard  transform  Ex  requires  O(dlogd)  operations.  It  turns  out  that  it  is  possible  to  compute 
PHx  with  O(dlogk)  operation,  as  long  as  the  matrix  P  contains  at  most  k  nonzeros.  This  will 
imply  Theorem  2.3,  because  code  matrices  of  size  k  x  d  are  a  product  of  PH where  P  contains 
k  rows  with  exactly  one  nonzero  in  each  row.  To  see  this  we  remind  the  reader  sthat  the  Walsh- 
Hadamard  matrix  (up  to  normalization)  can  be  recursively  described  as 


Hq  =  Hi®  Hq/ 2 
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(9) 

(10) 


Where  <8>  is  the  Kronecker  product. 

We  define  xi  and  X2  to  be  the  first  and  second  halves  of  x.  Similarly,  we  define  P\  and  P2  as 
the  left  and  right  halves  of  P  respectively. 


PHqx  =  (  Pi  P2  )  (  ^  _Hh‘^  )  (  xo  )  =  Pl^/2(X1  +  X2)  +  P2^/2(X1  -  X2)  (n) 

P\  and  P2  contain  k\  and  k2  nonzeros  respectively,  k\  +  k2  =  k,  giving  the  recurrence  relation 
T(d,k)  =  T(d/2,k\)  +  T(d/2,  k2)  +  d  for  the  running  time.  The  base  cases  are  T(d,  0)  =  0  and 
T(d,  1)  =  d.  We  use  induction  to  show  that  T(d,  k)  <  2dlog(fc  +  1). 

T(d,k)  =  T(d/2,  h)  +  T(d/2,  k2)  +  d 

<  dlog(2(fci  +  l)(k2  +  1)) 

<  dlog((k\  +  k2  +  l)2)  for  any  k\  +  k2  =  k  >  1 

<  2dlog(fc  +  l) 


The  last  sequence  of  inequalities  together  with  the  base  cases  clearly  also  give  an  algorithm  and 
prove  Theorem  2.3. 

Since  in  [1]  both  Hadamard  and  Fourier  transforms  were  considered  we  shortly  describe  also 
a  simple  trimmed  fourier  transform.  In  order  to  compute  k  coefficients  from  a  d  dimensional 
fourier  transform  on  a  vector  x,  we  divide  x  into  L  blocks  of  size  d/L  and  begin  with  the  first 
step  of  the  cooley  tukey  algorithm  which  performs  d/L  FFT’s  of  size  L  between  the  blocks  (and 
multiplies  them  by  twiddle  factors).  In  the  second  step,  instead  of  computing  FFT’s  inside  each 
block,  each  coefficient  is  computed  directly,  by  summation,  inside  it’s  block.  These  two  steps  require 
(d/L)  ■  Llog(L)  and  kd/L  operations  respectively.  By  choosing  k/\og(k )  <  L  <  k  we  achieve  a 
running  time  of  0(dlog(k)). 


8  Future  work 

•  Lower  bounds.  A  lower  on  the  running  time  of  applying  a  random  matrix  with  a  JL  property 
on  a  vector  will  be  extremely  interesting.  Any  nontrivial  (superlinear)  bound  for  the  case 
k  =  will  imply  a  lower  bound  on  the  time  to  compute  the  Fourier  transform,  because 
the  bottleneck  of  our  constructions  is  a  Fourier  transform. 

•  Going  beyond  k  =  d1/2-5.  As  part  of  our  work  in  progress,  we  are  trying  to  push  the  result 

to  higher  values  of  the  target  dimension  k  (the  goal  is  a  running  time  of  O(dlogd)).  We 
conjecture  that  this  is  possible  for  k  =  and  have  partial  results  in  this  direction.  A 

more  ambitious  goal  is  k  =  Ll(d). 
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A  Reducing  the  running  time  to  O(dlogk)  for  k  <  d1/2-5 

Recall  the  construction  in  Section  5:  <5  >  0  is  an  arbitrarily  small  constant,  we  assume  that 
k  <  d1/2-5,  that  k5  is  an  integer  and  that  (3  =  /gQjj(/c)fc<5  divides  d  (all  these  requirements  can  be 
easily  satisfied  by  slightly  reducing  5  and  at  most  doubling  d).  The  matrix  B  is  of  size  k  x  d,  and 
was  defined  as  follows: 

B  =  (Bk  Bk  ■  ■  ■  Bk)  , 

where  Bk  is  the  k  x  /bchW  c°de  matrix  from  Lemma  4.1.  Let  B  denote  ks  copies  of  Bk,  side 
by  side.  So  B  is  of  size  k  x  (5  and  B  consists  of  d//3  copies  of  B.  As  in  Section  5  we  start  our 
construction  by  studying  the  distribution  of  the  I2  estimator  Y  =  ||_BDx||2,  where  D  is  our  usual 
random  ±1  diagonal  matrix.  Going  back  to  (2)  (recall  that  M  is  the  matrix  whose  i’th  column 
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AfW  is  XiB W),  we  recompute  the  deviation  o': 


o-  =  ||M||2-2  =  sup  ||yTAf ||2 
II  i/ll=i 

/■>  X1'2 

=  sup  i^x,2(/B(,))2J 

(d/0 

=  sup  EE  xi(yTB®y 

\j=i  ieij 


1/2 


where  Ij  is  the  j’th  block  of  /3  consecutive  integers  between  1  and  d.  Applying  Cauchy-Schwartz, 
we  get 


(d/p  \  1/2 

a  <  sup  ^  ||x7j||21||yT5||l 

\i=l  /  (12) 

IMI=i  V  7 

=  (sup  ||yTB||4)  II a; II (4,2)  =  ||-BT||2^4||a:||(4,2)  , 
where  ||  •  ||(Pl,P2)  is  defined  by 

(d/0  \  1/p 2 

Ho..*)"  (ElKllgj 


and  x/.  G  £Pl  is  the  projection  of  x  onto  the  set  of  coordinates  Ij.  Our  goal,  as  in  Section  5,  is 
to  get  cr  =  0{k~1/2).  By  the  properties  of  dual  BCH  code  matrices  (Lemma  5.2),  we  readily  have 
that  || -B2" || 2_>4  =  0((/gQjj(/c)L<5)1/4A:_1/2)  which  is  0(k ^Z4)  by  our  construction.  We  now  need  to 
somehow  ’’ensure”  that  ||a?|| (4,2)  =  0(k~1^2 ~5/4)  in  order  to  complete  the  construction. 

As  before,  we  cannot  directly  control  x  (and  its  norms),  but  we  can  multiply  it  by  random 
orthogonal  matrices  without  losing  Z2  information.  Let  H'  be  a  block  diagonal  d  x  d  matrix  with 
d/f3  blocks  of  the  Walsh-Hadamard  matrix  H/3: 

\ 


\  h,3) 

Let  D'  be  a  random  diagonal  d  x  d  matrix  over  ±1.  The  random  matrix  H' D'  is  orthogonal.  We 
study  the  random  variable  X'  =  \\H'  D'x\\u2).  Let  M'  be  the  matrix  with  the  V th  column  M'W 
defined  as  We  notice  that  X'  is  the  norm  of  the  Rademacher  random  variable  in 

corresponding  to  M. 

Remark:  The  results  on  Rademacher  random  variables  presented  in  Section  3  apply  to  ’’non- 
standard”  norms  such  as  ||  ■  ||(Pl,P2)-  The  dual  of  ||  •  ||(Pl,P2)  is  ||  •  ||(<?1,<?2),  where  gi,  g2  are  the  usual  dual 
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norm  indices  of  pi,P2,  respectively.  It  is  an  exercise  to  check  that  ||^||(p1,p2)  =  suP||j/||(J  ,,  )=ixTy- 
We  compute  the  deviation  a'  and  a  median  p!  of  X'  (as  we  did  in  (4)): 


Cf'  —  II M II2— >(4,2)  —  ll^7  II (4/3,2)— >2 


=  sup 


yee 


Yxl(yTH(l)f 


1/2 


(4/3,2)  \  l 

3/11  —  1 


(d/p  \  1/2 

=  sup 

\i=l  i£lj 

(d/p  \  1/2 

<  sup  (  l^ijWlhlnpWl 

(d/p 

<  sup  I  ^  IK'  Wlhlj  \\l/3\\Hp\\l/3^4 

(d/p  \  1/2 

=  I \Hp  1 1 4/3^4  sup  f  ^  11®/,.  Hilly/*  111/3 


1/2 


where  the  first  inequality  is  Cauchy-Schwartz.  By  the  inequality  (]>Y  Aj )1/2  <  ^  •  A^2  holding  for 
all  nonnegative  A4,  A2,  • . . ,  we  get 


d/p 

<  11^114/3^4  sup  hljUWyijh/S  <  11^/3 II 4/3^4 Ik II (4,2)  • 

yee(4/3,2)  j= 1 

llwll=i 


(The  rightmost  inequality  is  from  the  fact  that  1 1 2T,  1 1 4/3  =  ^  and  the  definition  of  |k||(4,2)-) 

By  Hausdorff- Young,  \\Hp\\ 4/3_>4  <  /A1/4  =  0(&_1/2_l5/4),  hence  ex'  =  0(fc_1/2~<5/4|MI(4,2))-  Any 
median  p!  of  X'  is  0(fc_1/2~A4)  (details  omitted).  Applying  Theorem  3.3,  we  get  that  for  all  t>0, 

Pr[A'  >  p  +  t\<  4e_i2/(8<T2)  <  ci  exp{— C2^2 A:1+<5//2/ ||o?| | }  , 

for  some  global  ci,c2  >  0.  Setting  t  =  B(||x||(4)2)A;_<5//4),  we  get  that 

Vr[\\H'D'x\\^2)>p'  +  t]  =  0{e-k)  . 

Similarly  to  the  arguments  leading  to  Lemma  5.4,  and  with  possible  readjustment  of  the  parameter 
5,  we  get  using  a  union  bound 

Lemma  A.l  [f(4j2)  reduction  for  k  <  d1/2-5]  Let  H'  ,D'  be  as  above,  and  let  $>'  =  H'D'.  Define 
to  be  a  composition  of  r  i.i.d.  matrices,  each  drawn  from  the  same  distribution  as  Then 
With  probability  1  —  0(e~k) 

ll$,(r)x||(4,2 )  =  0(fe-1/2-5/4) 

for  r  =  [1/25] . 
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Combining  the  above,  the  random  transformation  A  =  BD has  the  JL  Euclidean  property 
for  k  <  and  can  be  applied  to  a  vector  in  time  O(dlogk),  as  required.  Indeed,  multiplying 

by  <h/  is  done  by  doing  a  Walsh  transform  on  d//3  blocks  of  size  (3  each,  resulting  in  time  0(d  log  k ). 
Clearly  the  number  of  random  bits  used  in  choosing  A  is  0(d). 
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